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A formalism specifying efficient, "emergent" descriptions of experimental systems 
is developed. It does not depend on an a priori assumption of limited available data. 



1.1 Introduction 

A complex systems can become an economical problem. Understanding its in- 
ternal machinery, describing it, and predicting its future behaviour can be ex- 
pensive. The problem of finding simple, accurate, and efficient descriptions is a 
central aspect of the work on complex systems. Perhaps it is the unifying aspect 
of complex-systems science. 

Interestingly, this practical problem is closely related to the philosophical 
problem of emergence 0] . Stated in its weakest form, this is the question 
why, if the basic laws of physics are so simple, the world around us appears to 
have such a rich structure. A partial answer that easily comes to mind is this: If 
we would try to apply the basic laws every time we interpret the world around 
us, it would just take too much time. Instead we are using other descriptions 
that are more efficiently. But each applies only to a particular part of the world, 
so we need many of them. In the language of computer science jZj , we are trading 
computation time for description length. Apparently, this is a good deal. The 
structure of the world as we see it is a result of solving just the economic problem 
mentioned above. We are reducing the cost of describing the complex system 
"world" . 

This is only a partial answer to the problem of emergence. Many questions 
remain unanswered, such as, "Why are there distinct parts for which efficient 
descriptions exist?" or "Can efficient descriptions be found systematically, and, 
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if yes, how?" . But it is this partial answer that will be of interest here, for it is 
itself incomplete. 

Efficient, simplified descriptions are rarely perfectly precise, and somehow a 
decision has to be made which information about the thing described the descrip- 
tion should reproduce, and which may be ignored. The conventional strategy to 
proceed when arriving at this part of the problem (e.g. is to presup- 

posed that the information regarding the thing described is incomplete anyway, 
and only the available information must be reproduced. This blurring of the 
picture comes under many different names: finite samples of noisy data, coarse 
graining, partitioning of the state space, e.t.c. As a result, the choice of the 
simplified description becomes essentially a function of the mode of observation. 
But does this correspond to the facts? The history of science knows many exam- 
ples of simplified descriptions (and related concepts) that have been introduced 
long before the things described could be observed. Obvious examples are de- 
scriptions in terms of quasi-particles such as "holes" and "phonons" used in solid 
state physics. On the other hand, descriptions that are much coarser than any 
reasonable limit of observation are also frequently used. One might just think 
of a description of traffic flow in terms of atomic "cars" . 

Shalizi and Moore [23 suggested a solution of this problem based on causal 
states PJ. Here, a different argument for reducing the information to be repro- 
duced by a description is explored. Information regarding the thing described is 
dropped not because it is unavailable, but for the sake of an efficient and simple 
description. Central to this argument is the distinction between two kinds of 
descriptions: models, that produce data somehow similar to present or future 
real data, and characterizations that summarize some aspects of data. 

Predictions about complex systems generally require both: a model that is 
used for the prediction, and a characterization that specifies what aspects of 
the real data the model is supposed to reproduce. By the condition that model 
and characterization are both simple and efficient, particular choices for the 
information to be retained by the descriptions are singled out. This part of the 
information is "relevant" for a simple reason: it can be predicted within given 
cost constraints. 

In the remainder of this work, it is shown that this approach can be taken be- 
yond hand-waving. Formal definitions of basic notions are introduced. Desider- 
ata for economic descriptions are summarized under the notion of basic model- 
specifying characterizations (b.m.s.c), and it is shown that nontrivial b.m.s.c. 
exist. They are by far not unique. The accuracy and detail of preferred descrip- 
tions depends on the available resources, and the formalism is taking this into 
account. Results are illustrated by a minimal example. 

1.2 The formalism 

For the formal analysis, both models and characterizations are represented by 
computer programs. The complex system to be described is represented by 
a computer-controlled experiment. Fig. If ,l| illustrates the interaction between 
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Figure 1.1: (a) Generic setup of a computer-controlled experiment, (b) Data flow in 
a test of a computational model. 



experimenter (the "Control Parameter" terminal), experiment, model, and char- 
acterization. A characterization of data is given by a statement saying the data 
passes a certain test; a statistical test in general. 

Throughout the theory, assume a control parameter format C C {0, 1}™ 
and a data format D C {0, l} rn to be fixed, with {0, l} fc denoting the set of 
all binary strings of length k and n,m 6 No- Given a control parameter value 
x £ C and being run, the experiment (including the D/A and A/D conversion) 
produces an output value y 6 D. Input and output data can be sets of numbers, 
images, time-series, e.t.c. The only major limitation is that both C and D are 
finite sets. The A/D conversion of the experimental output naturally involves 
some loss of information. But below it is argued that the information passing 
through the A/D converter can be much richer than the information tested for 
and being reproduced in the model. The information loss at the A/D converter 
is not decisive for determining the "emergent" description. 

In general, the complex system involved in the experiment is not determin- 
istic. The experimental output y is a realization of a random variable Y with 
values in D. The experiment is assumed reproducible in the sense that repeated 
runs of the experiment (with identical x) yield a sequence Y±, Y%, . . . of statisti- 
cally independent, identically distributed (i.i.d.) results. 

Definition 1 For a given (deterministic) machine model, a test t is a program 
that takes a control parameter x £ C as input, runs, and then halts with output 
0, 1, or e. When the output is not e, the test can request several data samples 
before halting ("rerun" in Fig. 11.1)) . Then execution of the test is suspended 
until a sample y £ D is written into a dedicated storage accessible by the test. 
The number of samples requested can depend on the sampled y but is finite for 
any sequence of successive samples. 
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By the output e the tests t indicates that x is not within the range of validity 
C[t] := {x E C|output of t with input x is not e} of the corresponding charac- 
terization. The outputs 1 or indicate that the null hypothesis (see below) is 
accepted or rejected by the test, respectively. 
Models are represented by generators. 

Definition 2 Given a machine model, generator g is a program that takes a 
control parameter x E C as input, runs, outputs data y E D and halts. The 
program has access to a source of independent, evenly distributed random bits 
in an otherwise deterministic machine. 

Now a cost functions is introduced which measures the cost involved in run- 
ning models g and tests t, constructing and evaluating them, and perform- 
ing experiments. We assume that this cost can be expressed in terms of the 
lengths L(t),L(g) E No of the programs t and g, their average execution times 
T(g),T(t) E R^°, and the average number N(t) E R-° of experimental runs 
required by t. To be specific, define T(-) as the maximum of the expectation 
value of the runtime over all x E C and all distributions of input data, N(-) 
analogously. It can be shown that T(t) and N(t) are always finite. As conven- 
tional, the number of tests or generators q with L(q) < n is assumed to be finite 
all n E N . 

Definition 3 A cost function K is a mapping K : No x R-° — > R-° or 

K : No x R-° x M-° — > R-° that increases strictly monotonically in all its 
arguments. The abbreviation Kit) stands for K[L(t),T(t), N(t)] if t is a test 
and K(g) stands for K[L(g),T(g)] if g is a generator. 

In practice, the cost of descriptions depends strongly on the circumstances. The 
theory should therefore be independent of the particular choice of the cost func- 
tion. For this purpose, as is made clear by Theorem [3] below, the following 
definition is convenient. 

Definition 4 Let pi and P2 be two tests or two generators. Then the relations 
< (always cheaper or equal j and ~< (always cheaper ) are defined by 

Pi d P2 ¥>L{ Pl ) < L{p 2 ) and T( Pl ) < T{p 2 ) and N{p x ) < N{p 2 ) (1.1) 
(for generators without the last condition ) and 

Pi -< Pi ^ Pi d Vi and not p 2 <Pi- (1.2) 

A test or generator p is said to be ^-minimal in a set P of tests or generators 
if p E P and there is no p 1 E P such that p' -< p. 

Lemma 1 Relation ^ is transitive and reflexive, relation -< is transitive and 
antirehexive. 

(Since ^ is not antisymmetric, it is not a partial order.) The proof is standard. 
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Lemma 2 For any two tests or generators p\, p 2 , and any cost function K, 
Pi -< P2 implies K(pi) < K(p 2 ). 

Proof Assume that pi and p 2 are generators. Then L(pi) < L(p2) and 
T{pi) < T(p 2 ) and either L{ Pl ) < L(p 2 ) or T( Pl ) < T{p 2 ), since if both 
were equal the last part of condition 11.2J1 would be violated. Thus, using 
the strict monotony of K, one has either K[L(pi),T(pi)] < K[L(p 2 ),T(pi)] < 
K[L(p 2 ),T(p 2 )] or K[L( Pl ),T( Pl )} < K[L(p 2 ),T( Pl )] < K[L(p 2 ),T(p 2 )]. Both 
imply K(pi) < K(p 2 ). For tests the proof is analogous. ■ 

Theorem 3 Let P be a set of tests or generators, p £ P is ^.-minimal in P if 
and only if there is a cost function K that attains its minimum over P at p. 

Proof The "if" part: If some K would attain its minimum over P at p but p 
was not ^-minimal, there would be a p' 6 P such that p' -< p and, by Lemma[21 
K{p') < K(p). But this contradicts the premise. So p is ^-minimal. 

The "only if" part: Assume p is -^-minimal in a set of generators P. We show 
that there is a cost function that attains its minimum over P at p by explicit 
construction. K(l,t) :— n(l,L{p)) + n(t,T(p)) with k(z, z ) — z for z < z and 
k(z, zq) = L(p) + T(p) + z for z > zo does the job. Obviously K satisfies strict 
monotony. And any p' £ P that does not have L(p') = L(p) and T(p') = T(p) 
[and hence K(p') — K(p)] must have either a larger L or a larger T than p, 
otherwise p would not be ^-minimal. But then K(p') > L(p) + T(p) — K(p). 
So K(p) is the minimum of K over P. For tests the proof is analogous. ■ 

Lemma 4 Every nonempty set P of tests or generators contains an element p 
which is -<-minimal in P. 

Proof Assume that P has no ^-minimal element. Then for every element 
p E P there is a p' 6 P such that p' -< p. Thus an infinite sequence of successively 
always-cheaper (-<) elements of P can be constructed. Because -< is transitive 
and antireflexive, such a sequence contains each element at most once. Let 
q be the first element of such a sequence. Since by definition p -< q implies 
L(p) < L(q), and there is only a finite number of programs q with L{q) < 
the number of successors of p cannot be infinite. So the premise that P has no 
-(-minimal element is wrong for any nonempty P. ■ 

The -(-minimal element is generally not unique. Different ^-minima minimize 
cost functions that give different weight to the resources length, time, and, ex- 
perimental runs used. On the other hand, it turns out that in practice the 
machine dependence of relation -< for implementations of algorithms on differ- 
ent processor models is weak. Therefore, instead of cost functions, relation -< is 
used below. 

A central element of statistical test theory |Hj is the power function. It is 
defined as the probability that the test rejects data of a given (usually parameter- 
ized) distribution. The goal of statistical test theory is to find tests who's power 
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function is below a given significance level a if the null-hypothesis is satisfied, 
and as large as possible otherwise. 

Denote by the test function t x {{yi}) the output of the test t at control 
parameter x G C[t] when applied to the sequence of experimental results {yi} G 
D°° (for formal simplicity, the sequences {yi} are assumed infinite, even though 
tests use only finite subsequences). 

Definition 5 For any test t, the power of the test function t x , when applied 
to the random sequence {Yi} with values in D°° , is the probability of rejecting 
{Yi}, i.e., 

pow(* x , := Px[t x ({Yi}) = 0] (x G C[t\). (1.3) 

Unlike in conventional test theory, there is no independent null hypothesis Hq 
here that states the distribution or the class of distributions of {Yi} that is 
tested for. Instead, given a test function t Xl the null hypothesis, i.e., the class 
of distributions, is defined by the condition 

pow(i x , {Yi}) < a, (1.4) 

where < a < 1 is a fixed 1 significance level. 

Now the concepts from statistics and computer science introduced above are 
combined. Denote by g x the sequence {Y{} of random outputs of generator g at 
control parameter x. 

Definition 6 A generator g is an optimal generator relative to a test t and 
a power threshold 1 > 7 > a (notation: opt]g) if 

1. pow(t x , g x ) < a for all x G C[t] and 

2. for every generator g' -< g there is a x G C[t] such that pow(t x , g' x ) > 7. 

This implies that g is -^-minimal in {g'\pow(t x , g' x ) < a for all x G C'ft]}. Hence 
g is, for some cost function, the minimal (-cost) model for the property that t 
is testing for. Condition 2. can be satisfied only for particular choices of t. It 
requires a minimal power 7 from t to distinguish the models that it characterizes 
from those is does not. Constructing tests that maximize 7 leads to results 
similar to the locally most powerful tests of statistical test theory . 

For an i.i.d. random sequence {Yi} denote by the distribution function 

of its elements, i.e., p[{Yi}](y) := Pr[Yi = y] for y 6 D. 

Definition 7 Call a generator g an optimal implementation with respect 
to a set C C C if it is ^-minimal in {g'\p[g' x } = p[g x ] for all x G C} (the set of 
generators that do exactly the same). 

Theorem 5 For every C C C , every optimal implementation g with respect to 
C, and every 1 > 7 > a there is, a test t such that opt]g and C[t] — C. 

1 From t x tests for the same Hq at other significance levels can be constructed. 
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Proof An explicit construction of t is outlined: x £ C can be tested for by 
keeping a list of C in t. Since there is only a finite number of g' ^ g, the test 
must distinguish p[g x ] from a finite number of different distributions p[g' x ] for all 
x G C, with power 7. This can be achieved by comparing a sufficiently accurate 
representation of p[g x ], stored in t for all x £ C, with a histogram obtained from 
sufficiently many samples of g' x , ■ 

Definition 8 Call a pair (t, g) a basic model-specifying characterization 
(b.m.s.c.) if t is ^-minimal in {t'\opt],g and C[t] C C[t']} for some 1 > 7 > a. 

That is, for some cost function the test t gives the minimal characterization re- 
quired to specify g (given power threshold 7 and range of validity C[t]). Some- 
times there are other generators which are similar to g but cheaper. Then t 
must be very specific to characterize the particularities of g. In other cases, the 
output of g has an essentially new, "striking" property which cannot be obtained 
with cheaper generators. If the property is really "striking" , a rather cheap and 
generic test t is sufficient to detect it. Thus t can ignore all other information 
contained in the output of g. Such an approximate characterization is most likely 
to apply also to the data of an actual experiment. Then the b.m.s.c. (t,g) pro- 
vides a specific but economic description. After verifying the b.m.s.c. for some 
control parameters x £ C[t], approximate predictions of experimental results for 
other parameters can be obtained from g by the usual (though philosophically 
opaque) method of induction. 

A trivial b.m.s.c. is given by a test t that always outputs 1 and some generator 
g ^-minimal among all generators. But the following makes clear that the world 
of b.m.s.c. is much richer. 

Theorem 6 There is, for every C C C and every optimal implementation g 
with respect to C, a test t such that (t,g) is a b.m.s.c. and C C C[t]. 

PROOF Fix some 1 > 7 > a. By Theorem|31 the set S :— {t'\opt],g and C C 
C[t']} is nonempty. Theorem is satisfied by any t which is -^-minimal in S. 
By Lemma 0| such an element exists. ■ 

1.3 A simple example 

As a minimal, analytically traceable example, consider an experiment without 
control parameters C — in which only a single bit is measured, D — {0, 1}. 
The probability p for the cases y = to occurs is exactly p — 0.52 and the 
"complexity" of the systems consists just in this nontrivial value. With a = 0.1, 
the following pair (t,g) is a b.m.s.c: A generator g [with L(g) — 52 byte and 
T(g) — 56 u on the mmix model processor j^j; the unit of time reads "oops"] 
that outputs y — and y — 1 with exactly equal probability p — 1/2, and a test 
t (L(t) = 104 byte and T(i) = 255 v) that verifies if among N = 5 samples both 
y = and y = 1 occur at least once. This test is the cheapest test that accepts 
the model g (pow(t, {g}) = 1/16 < a) and rejects all cheaper models, namely 
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generators g' that always output the same value [one finds L(g') = 28 byte, 
T(g') = 38 v, pow(t, {g 1 }) = 1 > a]. But t also characterizes all experiments 
for which pow(i, {Yi}) = p N + (1 — p) N < a, such as our case p = 0.52, where 
pow(i, {Yi\) w 0.064. 

There are other b.m.s.c. for the experiment. For example, a generator g* 
that computes a 8-bit random integer in the range 0, ...,2 8 — 1, and uses it to 
output y = with probability p = 133 x 2~ 8 = 0.5195 and y = 1 otherwise 
[L(g*) = 76 byte and T(g*) = 225 v}; and a test t* that verifies if within 962 
samples between 437 and 487 cases y — occur [L(t*) = 112 byte, T(t*) = 
40430 u]. One finds pow(t*, {g*}) = 0.099834 < a = 0.1 and pow(t*,{Y;}) = 
0.099832 < a for the experimental data. The next cheapest generators, which 
have p = 132 x 2~ 8 = 0.5156 or p = 134 x 2~ 8 = 0.5234, and are faster because 
they require only 6-bit or 7-bit random numbers respectively, are rejected with 
a power larger than 7 = 0.108576 > a. A cheaper test could not reach this 7. 

One might think of g, g*, and some exact g** as a primitive from of different 
levels of description for the same experiment. 
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